labelSeg: segment annotation for tumor copy number alteration profiles

Abstract Somatic copy number alterations (SCNAs) are a predominant type of oncogenomic alterations that affect a large proportion of the genome in the majority of cancer samples. Current technologies allow high-throughput measurement of such copy number aberrations, generating results consisting of frequently large sets of SCNA segments. However, the automated annotation and integration of such data are particularly challenging because the measured signals reflect biased, relative copy number ratios. In this study, we introduce labelSeg, an algorithm designed for rapid and accurate annotation of CNA segments, with the aim of enhancing the interpretation of tumor SCNA profiles. Leveraging density-based clustering and exploiting the length–amplitude relationships of SCNA, our algorithm proficiently identifies distinct relative copy number states from individual segment profiles. Its compatibility with most CNA measurement platforms makes it suitable for large-scale integrative data analysis. We confirmed its performance on both simulated and sample-derived data from The Cancer Genome Atlas reference dataset, and we demonstrated its utility in integrating heterogeneous segment profiles from different data sources and measurement platforms. Our comparative and integrative analysis revealed common SCNA patterns in cancer and protein-coding genes with a strong correlation between SCNA and messenger RNA expression, promoting the investigation into the role of SCNA in cancer development.

In scenarios where the cluster standard deviation is either unavailable (owing to identical logR values within the target low-level cluster) or deemed too small (   −  < .
When employing HDBSCAN clustering, the adjustment for small standard variation is omitted because this method does not require the  parameter during its implementation.

Relationship of logR values between different copy number states
The logR value is influenced by the tumor sample's purity and ploidy, posing a challenge in accurately determining diverse relative copy number states within heterogeneous CNA profiles.To address this issue, labelSeg employs a numerical model that establishes relationships between logR values corresponding to distinct copy number, accommodating varying levels of tumor purity and ploidy.We validated this relationship through the following model, and the results of this validation are presented in Supplementary Figure S2.For a given segment i with copy number   from sample j, characterized by tumor cell ploidy   and purity   , the corresponding logR value is computed as follows: Eq.S 1 In this model, the baseline logR value is always 0, although this is generally not the case in real scenarios.To account for this difference, labelSeg incorporates the distance between low/high copy changes and the estimated baseline as a representation of the genuine logR value arising from corresponding copy changes in practical implementation.

Special cases
In certain scenarios, labelSeg operates differently to address specific challenges and constraints.If segment profiles are excessively fragmented and no individual segment covers more than 20% of a chromosome, labelSeg will issue a warning and terminate its operation.In such cases, users may merge segments prior to applying the algorithm.
In instances where the segment profiles are heavily influenced by noise, labelSeg may struggle to accurately identify the baseline cluster.Should this occur, the feature value of the baseline cluster is set to 0 and utilized in subsequent calculations.
When long segment clustering fails to identify any low-level target cluster, e.g. the absence of broad low-level SCNAs in the analyzed sample, labelSeg calculates low-level calling thresholds from short segment clustering using the same method but with a more lenient boundary.
In situations where both long and short segment clustering do not yield low-level target clusters, indicating the absence of both broad and focal low-level SCNAs, labelSeg employs fixed distances of 0.6 and 1 from candidate clusters to the baseline cluster to pinpoint target high-level duplication and deletion clusters, respectively.

Synthetic datasets
The generation of simulated data was based on Willenbrock and Fridlyand's work with a slight modification [1].
For each sample, ploidy was assumed to be 2, and the true copy number (CN) of each segment was drawn from a fixed state probability {0.05,0.15,0.55,0.15,0.05,0.05}when CN= {0,1,2,3,4,5}.The lengths of segments with different copy number states were assigned by sampling from the empirical length distribution, which was established by binning logR values of the combined segment dataset (Supplementary Figure S1).The intervals were less than -1.2 (0 copies), between -1.2 and -0.3 (one copy), between -0.3 and 0.3 (two copies), between 0.3 and 0.7 (three copies), between 0.7 and 1.2 (four copies) and greater than 1.2 (five copies).The intervals were constructed to simulate the real segment length distribution rather than to indicate the real logR-copy number relationship.Unlike the original method, the length of segments with CN = 1,2,3 was scaled by 2.5,3,2.5 respectively.This modification accounts for the prevalence of normal segments or SCNAs with low copy number alterations spanning large contiguous genomic regions.Gaussian noise was applied with a variable standard deviation depending on the segment length.Let   denote the proportion of tumor cells, namely sample purity, in one sample j,   represent the length of a segment i,   be the copy number of a segment i,   2 (  ) denote the 50 ℎ percentile of the length of segments with the same copy number as   , and   3 (  ) denote the 75 ℎ percentile of the length of segments with the same copy number as   .The final logR value for segment i in sample j was computed as follows: where, We simulated various scenarios, including distinct tumor sample purities (0.5, 0.7, and 0.95) and a highly noisy situation where   was three times the original value indicated in Eq.S3.Each scenario comprised 500 samples, each containing 200 segments.To compare with labelSeg outputs, true copy numbers were transformed to true relative copy number state labels.

Real datasets
The TCGA segment datasets were acquired using TCGAbiolinks R package [2].These datasets consist of masked copy number segment profiles derived from primary tumors and generated through the Affymetrix SNP 6.0 platform.After randomly removing duplicate samples from the same patient, the TCGA-GBM dataset comprised 596 samples from patients with glioblastoma multiforme (GBM), the TCGA-LUSC dataset contained 503 samples from patients with lung squamous cell carcinoma (LUSC), and the TCGA-LAML dataset contained 194 samples from patients with acute myeloid leukemia (LAML).The gene-level ASCAT results were downloaded from the GDC data portal.The absolute copy numbers estimated by ASCAT were then transformed into relative copy number state labels.Ploidy was defined as the absolute copy number spanning most regions of the genome.
GISTIC outputs the same CNA state labels as labelSeg that indicate relative copy number states.We used available gene-level GISTIC calls of TCGA datasets from cBioportal database [3,4] for comparison.Except for the existing gene-level calls from GISTIC and ASCAT, other methods, including labelSeg, produced labels for each segment.These segment-level labels were subsequently transformed into gene-level labels.The assignment of gene-level labels was based on the copy number state of the overlapping segment with the maximal severity, as proposed in a previous study [5].
The MSK-IMPACT segment dataset was acquired from the cBioportal database (search keyword: msk mind, nature cancer 2022).This dataset comprises 247 samples from patients with advanced non-small cell lung cancer (NSCLC).Since this dataset includes gene-level CNA calls generated by their in-house pipeline, we utilized this gene-level data as a reference instead of ASCAT calls.It's important to note that the amplification and deletion in the reference data are not identical but similar with high-level duplication and deletion, as discussed in this manuscript.Therefore, our comparison focused on the alignment in high-level SCNAs for this dataset.The GISTIC results in this dataset were generated using the GISTIC 2.0 module on the GenePattern [6] website (version 6.15.30) with default parameters.

Section3 Data collection
The collection of TCGA segment datasets was described in Section 2.2.The Progenetix segment datasets were collected from the arrayMap cohort of Progenetix database.Both TCGA and Progenetix data were generated from microarrays, and segments with fewer than 10 probes were excluded.For the CCLE segment datasets, we downloaded them from the depmap data portal 22Q4 release, and duplicate profiles from the same cell line model were excluded after manual quality check.The CPTAC segment datasets were collected from two resources.Glioblastoma segment profiles were downloaded from cBioportal, and lung squamous cell carcinoma segment profiles were downloaded from GDC data portal.The lung squamous cell carcinoma segment profiles were AscatNGS calls [7] with absolute copy numbers instead of logR values, and we transformed them into SCNA labels.The genome version of these data is hg38, except for the CPTAC glioblastoma data, which were converted from hg19 to hg38 using segment_liftover Python package [8].The mRNA expression data of matched TCGA samples were obtained via TCGAbiolinks R package.

2 )
, the value of   −  will undergo an adjustment and be set to |  −  −   | 10 FIGNL1